arXiv:1509.07238vl [cs.SE] 24 Sep 2015 


Frequency Distribution of Error Messages 


David Pritchard 

Center for Education in Math and Computing, University of Waterloo, Canada * 

dagpritchard@uwaterloo.ca 


Abstract 

Which programming error messages are the most common? 
We investigate this question, motivated by writing error ex¬ 
planations for novices. We consider large data sets in Python 
and Java that include both syntax and run-time errors. In 
both data sets, after grouping essentially identical messages, 
the error message frequencies empirically resemble Zipf- 
Mandelbrot distributions. We use a maximum-likelihood ap¬ 
proach to fit the distribution parameters. This gives one pos¬ 
sible way to contrast languages or compilers quantitatively. 

Categories and Subject Descriptors D.3.4. [Programming 
Language Processors]: Compilers, Run-time environments 

Keywords Error messages, empirical analysis, usability, 
education. 

1. Introduction 

This work started as an offshoot of Computer Science Cir¬ 
cles (CS Circles) ll33l[^ . a website with 30 lessons and 100 
exercises teaching introductory programming in Python. It 
contains a system where students can ask for help if they 
are stuck on a programming exercise. Often, students re¬ 
ported being stuck because they could not comprehend an 
error message, asking for a better explanation of what the 
compiler/runtime was trying to say. E.g., the message 

SyntaxError: can't assign to function call 

might not be understood by a novice who wrote s qrt (y) =x. 

Motivated by this, we decided to systematically improve 
the error messages that students received. There is copious 
literature on writing good error messages d EH EH El 
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l3l l40l . but how can this advice be incorporated into the 
programming ecosystem? One approach would be making 
upstream improvements to the compiler/runtime, but this 
can take a long time, and not all audiences would appre¬ 
ciate the changes that would most benefit novices. A sec¬ 
ond approach would be to write a tool that analyzes code 
from scratch, looking for common syntactic bugs or likely 
semantic mistakes. The literature includes many such tools: 
see checkstyle, findbugs and |[IOl[l5lEQlE2l|23]E6l. 

We chose a more lightweight approach: augmenting the 
normal error messages with additional explanations. To wit, 
we compile and execute the code as usual, and then add a 
beginner-appropriate elaboration of the resulting error mes¬ 
sage, implemented by rendering the normal error with a 
clickable pop-up link to the explanation. This augmenting- 
explanation approach has been previously used on a small 
scale with Java compiler errors IH §5.2], Python runtime er¬ 
rors ifTTl §5.2.1], and C-H- STL compiler errors BTIl . 

It has long been observed that “a few types of errors ac¬ 
count for most occurrences” ll35l . see also m. In order to 
make sure that a small number of explanations would be 
useful as often as possible, we had to answer the following 
question: what error messages are the most common! Count¬ 
ing error message frequencies has a long history, starting 
from assembler IHEll and SP /k iflTl . with renewed interest 
more recently, using much larger data sets IDElinillSlESl. 

Using with the history of all previous submissions, we 
determined the essentially distinct error messages and their 
frequencies (see Section]^, available online at http:// 
daveagp. github. io/errors We wrote explanations for 
the 36 most common messages. Regular expressions were 
used to aid the implementation. At the most basic level, 
some errors were made more readable by elaborating them 
into a full paragraph of text rather than a one-line message. 
Some explanations include concrete examples of code that 
causes the same error message, and a description of how to 
fix it. See lfT4ll25]l2HI29][39]l40l for advice on writing error 
messages. Given a large data set, the work involved in this 
group-and-explain approach is modest and not technically 
challenging, so we would recommend it in any beginner¬ 
facing system. Moreover, in internationalized settings, one 
can then add explanations in other languages (this has been 
implemented in CS Circles’ Lithuanian translation). 



This paper compares and contrasts the most common er¬ 
ror messages in CS Circles with those in another program¬ 
ming language. The Blackbox project iniiii is a large-scale 
data collection-and-sharing project using BlueJ, a Java pro¬ 
gramming environment oriented at beginners. We obtained 
the error messages from all recorded compilation and execu¬ 
tion events, grouping and counting the essentially different 
messages like we did for the Python data set. Comparing the 
two data sets, we found that both error message frequency 
distributions resembled the same family of distributions, the 
Zipf-Mandelbrot distribution M- For these data sets, this 
means that for any integer k, the frequency of the A:th most 
common error is approximately proportional to l/(fc -f t)"^ 
where t and 7 are parameters of the data set. In order to 
determine the best values for these parameters, we propose 
using a simple maximum-likelihood approach. 

1.1 Discussion and Other Related Work 

Orthogonal to purely quantitative analysis, a large body of 
work focuses on manual categorization of errors. This allows 
researchers to get more accurate results, and to precisely 
understand the psychological state of the user, rather than 
focus on the compiler-generated error messages themselves. 
Good reasons for doing this include that “A single error may, 
in different context, produce different diagnostic messages” 
and that “The same diagnostic message may be produced by 
entirely different and distinct errors,” see ET). This analysis 
also helps measure whether a compiler’s error messages are 
appropriate (e.g., see HUES]). This analysis is important 
for compiler designers, language designers and educational 
research, but it is not our focus. 

The comparison of error message frequencies between 
different languages raises many interesting open-ended 
questions. Even within the same language, some compil¬ 
ers are significantly better or worse than others; see Brown’s 
amusing crowdsourcing of Pascal error messages Eia as 
well as Eusni. One way to view different error message 
distributions is to imagine the extremes: the worst possi¬ 
ble language would only ever say “?” without elaborating 
(this has been formally evaluated, see Il39l ), while the best 
possible language would, like a human tutor, always give a 
perfectly adapted explanation. The exponent 7 in our work 
is one way to measure where a language sits between these 
extremes. However, simpler measures such as entropy could 
also be used. Also, a single quantitative measure should not 
be treated as paramount without context. When comparing 
languages/compilers (e.g., ESI), statistical fitness is less im¬ 
portant than overall usability, including measures like time 
between errors and time to achieve user goals. 

A notable alternative approach to improving student feed¬ 
back based on large-scale data, rather than focusing on error 
messages, is the HelpMeOut system im, which uses a de¬ 
tailed repository of past student work sessions to find old 
errors similar to new ones and make suggestions of how to 
fix them. 


To our knowledge, this paper is the first one to exam¬ 
ine any link between programming error messages and sta¬ 
tistical distributions. The special case f = 0 of the Zipf- 
Mandelbrot distribution is known as the power law distribu¬ 
tion. It arises empirically in data sets such as the frequency 
of distinct words in books, of links in webpages, and of ci¬ 
tations in literature. Caveats apply here ||5] |TT] |30l, includ¬ 
ing: that generative explanations of how these distributions 
could arise are tenuous; that near-power-law data sets may 
be even closer still to other distributions; and that analyzing 
such data sets has common pitfalls like using linear regres¬ 
sion. Another caveat for our work is that the distributions of 
error messages will depend on the nature of the users, and 
the kind of setting in which the work is collected. In our 
case, both data sets come from a very large, open project in¬ 
tended for beginners. We anticipate that a data set where stu¬ 
dents only work on a fixed set of exercises could be skewed 
in some way, but both BlueJ and the CS Circles “console” 
allow students to do any sort of open-ended programming. 

See ll^ for discussion of power laws in runtime object- 
reference graphs of industry-scale computer programs. 

2. Data Sets 

Our first data set is the Python corpus from CS Circles. 
Amongst the first 1.6 million code submissions, about 
640000 resulted in an error. Our second data set is the Java 
corpus from BlueJ Blackbox. We specifically considered 
the “compile” events, of which there were about 8 million, 
half of which produced an error, and the “invoke” events, of 
which there were about 5 million, about 260000 of which 
produced a syntax error and 180000 of which produced a 
run-time error. We did not include the codepad or unit test 
events, both of which are an order of magnitude smaller. 

In both cases, following OTll . we only counted the first er¬ 
ror message. This tends to be the most accurate error (since 
a syntax error can cause new valid parts of the program be¬ 
low to be reported as errors) and it is also the error that the 
programmer is most likely to pay attention to and fix first. 
Moreover, CS Circles only shows the first error message in 
its user interface; and even for a UI like BlueJ that shows 
multiple errors, beginner students often (by habit or by in¬ 
struction) fix only one at a time and then recompile/re-run{^ 

After obtaining these raw data sets of hundreds of thou¬ 
sands of error messages, we had to count how many time 
each distinct message occurred. It is necessary to “sanitize” 
the data by removing parts that pertained to specifics of user 
code rather than the kind of error. For instance, NameError: 
name 'x' is not defined should be understood by our 
system to be essentially the same error as NameError: 


* It would not be invalid to investigate data sets where all errors are reported 
and counted, but a worry is that it might say more about the statistics of 
chain effects in syntax errors and less about the actual underlying bugs. Two 
other strategies, “count-all” and “count-distinct,” are used in (38), though 
their study participants were professionals and not novices. 



name 'sum' is not defined SO that the same explana¬ 
tion will appear in either case. The sanitization was an itera¬ 
tive process. Simple heuristics handled most cases correctly, 
and in total we needed about 20 sanitization mles for Python 
and 50 for Java, implemented using regular expressions. 

There is a question of how far one should sanitize. Should 
these two error messages be considered the same? 

RuntimeError: maximum recursion depth exceeded 
while getting the repr of a list 
RuntimeError: maximum recursion depth exceeded 
while getting the repr of a tuple 

Overall we tended to use fewer sanitization rules rather 
than more (considering the above to be different); a similar 
approach was used in Il38l . Conceptually, to fix a single ob¬ 
jective goal for sanitization, we imagined that each category 
should uniquely correspond to a single line of source code 
of the compiler/mntime where the error is hrst detected. 

Another step in sanitization was to remove any non- 
English error messages, to avoid inadvertently seeing the 
same patterns repeated in multiple languages, which might 
affect the results. This was done by removing all messages 
with non-ASCII characters, and manual hltering. 

2.1 Overview of Data Sets 

The Python data set yielded 309710 syntax errors and 
333538 compile-time errors. The Java data set yielded 
4002822 compile-time errors and 129650 run-time errors. 
Note that the Java data set has a much smaller proportion of 
ran-time errors than Python (only about 3% rather than al¬ 
most half). But to a degree, this difference is inherent in the 
language, since many errors that would occur at compile- 
type in Java’s strict typing-and-scoping system are not en¬ 
countered until run-time in Python. 

After sanitization and grouping, the Python data set 
yielded 283 distinct error messages. Of these, 17 occurred 
exactly twice and 42 occurred only once (for example, 
ValueError: Eormat specifier missing precision 
and SyntaxError: can't assign to Ellipsis). The 
Java data set yielded 572 error messages in total; 65 occurred 
exactly twice and 127 occurred only once (for example, 
com. vmware . vim25 . InvalidArgument and cannot 
create array with type arguments). 

Errors are not completely parallel for both languages. Eor 
example, Java allows function overloading, i.e. two func¬ 
tions with distinct signatures but the same name. In Python, 
this must instead be implemented by a single function that 
takes different actions depending on the runtime number and 
type of its argument(s). It is the function’s responsibility to 
generate the error message. It turns out that not all such func¬ 
tions generate identical messages and so the single Java error 
message no suitable method found corresponds to 
more than one distinct Python error message: 
f argument must be a string or number, not T 
and f arg 1 must be a type or tuple of types. 
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Figure 1. The two data sets for our study. The CS Circles 
data set is Python, while BlueJ is Java. The plots are log-log. 

The 5 most common Python errors were: 

179624 SyntaxError: invalid syntax 
97186 NameError: name 'NAME' is not defined 
76026 EOEError: EOE when reading a line 
26097 SyntaxError: unexpected EOE while parsing 
20758 IndentationError: unindent does not match 
any outer indentation level 

The 5 most common Java errors were: 

702102 cannot find symbol - variable NAME 
407776 ';' expected 

280874 cannot find symbol - method NAME 
197213 cannot find symbol - class NAME 
183908 incompatible types 

We plot both data sets in Eigure [T] The m-axis measures 
the rank of each error message (with 1 being the most fre¬ 
quent) and the y-axis measures the number of times each er¬ 
ror occurred. Using a logarithmic scale is necessary for the 
changes in the y-axis to be visible, and we also use a loga¬ 
rithmic scale for the x-axis. Notice that both data sets give 
rise to similar distributions; in the rest of the paper we will 
try to describe them in a common framework. 

2.2 Notation 

Eor any given data set, we will use N to denote the total 
number of errors logged, and M for the number of distinct 
error types. Eor example, the Python data set has N — 
643248 and M = 283. Let Fk denote the number of times 
that the fcth-most common error occurred, e.g. Fi = 179624 
for Python. We will also write F^ax := Fi as an alternate 
symbol for the same value, when we wish to emphasize that 






















#F~^{f) where / is: 

1 

2 

3 

4 

5 

6 

Python 

42 

17 

19 

13 

6 

4 

Java 

127 

65 

31 

17 

18 

15 


Table 1. Number of /-legomena in each data set. 


it is the maximum frequency. The smallest frequency Fm 
is 1 for both of our data sets. In lexicography, the items 
occurring just once are known as the hapax legomena of the 
corpus. An f-legomenon is any etTor message that appears 
exactly / times. We will use the symbol 

to denote the number of /-legomena. The first few counts of 
/-legomena in our data sets is listed in Table 

3. Power Law Distributions 

When studying frequency counts of different objects, a dis¬ 
crete power law distribution is one in which the frequency 
Fk of the fcth most common item is proportional to 1/lF*. 
As mentioned in the introduction, power law distributions 
provide good fits to many unrelated empirical distributions. 
A common example is that in the novel Moby Dick, the fre¬ 
quency of the /cth-commonest word is approximately propor¬ 
tional to “Zipf’s law” is sometimes used as a syn¬ 

onym for the discrete power law, but sometimes also refers 
to the special case F/, oc. 1/k where 7 = 1 . There is also 
a large body of work on continuous power laws, where one 
sorts items by some magnitude that takes on continuous val¬ 
ues, and examines the relationship between rank and magni¬ 
tude. See many examples of both types in Q. 

Note that sanitization of error messages is particularly im¬ 
portant because of the fact that many natural languages fol¬ 
low power-law curves. If we did absolutely no sanitization, 
then our error message distributions would have significant 
aspects determined by the frequency distribution of variable 
names chosen by users, and finding a power law describing 
the latter would be less surprising, given that natural lan¬ 
guage is already known to exhibit power law behaviour. 

We now turn to analyzing our data sets from the power 
law perspective. Do they approximately satisfy a power law? 
This did not appear to be the case: a power law, when plotted 
on a log-log scale, should give a straight line, but it is clear 
from Figure that this is not an accurate description of our 
data set. 

3.1 Zipf-Mandelbrot Distributions 

There is a generalization of power law distributions called 
the Zipf-Mandelbrot family of distributions. These distribu¬ 
tions are defined, using two parameters t and 7, by 

1 


Such a sequence should appear linear on a log-log plot pro¬ 
vided that along the x-axis, we plot the logarithmic positions 
of (fc -b t) rather than of k. When we tested plotting these 
distributions in this modified way, for an appropriate value 
of t, we obtained a much more persuasive fit: in Figure 
which has the shift t = 60, the points very nearly fall on a 
line. This shift was obtained by trial-and-error, and the line 
drawn in has slope —7 = —6.3. In the rest of the paper we 
aim to give a more principled way of estimating t and 7. 

Is a Zipf-Mandelbrot distribution plausible? Here is one 
argument that, if we accept that power laws can arise in 
natural settings, that there is reason to suspect that Zipf- 
Mandelbrot laws can too. It is not meant to give an ex¬ 
haustive explanation, just an argument for plausibility. Sup¬ 
pose we start with a power law, and then coalesce several 
items together. I.e., replace several distinct error messages 
with a single unified message having the sum of their fre¬ 
quencies. (In a list of English words, the analogy would 
be that a single word has multiple meanings.) The effects 
of this message-merging would be twofold: the resulting 
new message would be an outlier to the original power law 
curve; and the remaining data points, when plotted on a rank- 
frequency scale, would be shifted several positions to the 
left, i.e. they would follow a Zipf-Mandelbrot distribution in¬ 
stead of a power law distribution. This is indeed a plausible 
scenario for the Python data set! The most common error, 
SyntaxError: invalid syntax, is very generic. It 
can be obtained by writing two tokens in a row (such as 
forgetting a comma or quote marks), using an assignment 
statement in place of a conditional expression (such as using 
if a=b : instead of using ==), by mismatching parenthe¬ 
ses, etc. 

3.2 Consequences of a Zipf-Mandelbrot model 

What behaviour does a Zipf-Mandelbrot model predict? It 
postulates that there is some innate ordering of error mes¬ 
sages, from most frequent to least frequent, so that the in¬ 
herent probability F^ of the fth most frequent error message 
is proportional to {£ The reason that we use the sub- 
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Figure 2. The Python data set with the (pre-logarithmic) 
cc-axis shifted by f = 60, and a straight line with slope 
—7 = —6.3 that approximately fits most of the data. 
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script £ here is that our concrete data set arrives via sampling 
from the inherent distribution. So like a sampling error, the 
observed ordering of messages from most to least frequent 
is not necessarily the exact same as the innate ordering. 

An interesting aspect of this model is that it assumes 
Fg cx (£ + t)~~* continues to hold for arbitrarily large i. 
Can this be plausible: is the total number of possible er¬ 
rors infinite? We will accept this as a reasonable hypothesis, 
which if not literally true, could continue long enough to 
be consistent with the size of any measured data set, for the 
following reason, using Python as an example. The most 
common errors we see are the ones in the Python core code 
(the syntax errors, and runtime errors from the “builtins”). 
Less frequently we start to see errors from Python mod¬ 
ules, such ValueError: math domain error within the 
sqrt function of the math module. While this is the only 
module taught on the site, users have occasionally submit¬ 
ted code using other common modules like time, random 
and functools, each of which comes with its own spe¬ 
cific errors. Moving on, there would be errors from rarely- 
used modules, then even after this, modules that users may 
with more or less frequency import (or copy in) themselves. 
For example, we observed a mainfile: error: must 
provide name of pdb file error caused by someone 
who copied in a Python program for use with the X-PLOR 
biomolecular structure determination software Oil. The 
same phenomenon happens in the Java data set. So errors 
with arbitrarily small inherent frequency are not unreason¬ 
able despite the finite size of the languages. 

4. Analysis 

To fit our data to a Zipf-Mandelbrot distribution, several ap¬ 
proaches are possible. For power laws, the naive approach, 
using a least-squares fit to a linear log-log plot (c.f. Figure]^ 
is known to introduce errors ini. Rather, we will follow 
Newman lf30l . who considered maximum likelihood estima¬ 
tion methods for power laws. Some work will be needed to 
extend this to Zipf-Mandelbrot distributions. 

The method in OOl involves a particular way of process¬ 
ing the data; let us mention the motivation. The direct ap¬ 
proach to maximum likelihood estimation would be to de¬ 
termine the 7 and t that maximize Ylki ^/where 
C is the normalizing constant with C ■ + = 1. 

Izsak m suggests this. But trying this approach gives un¬ 
satisfactory results with any of our data sets — the curves 
produced fit the data very poorly except in the regime of 
Fi. The calculation goes wrong because it is too heavily bi¬ 
ased by the highest-frequency errors. (For Zipf-Mandelbrot 
in particular, if the argument in Section [3T| were to be true, 
then it should be no surprise that fitting to Fi would be 
problematic, since Fi would be an outlier from the norm.) 
Also, the most likely fit entails that the innate order ex¬ 
actly matches the observed frequency-ordering of error mes¬ 
sages m, which is itself unlikely. 


4.1 Probabilities of Frequencies 

This motivates the maximum likelihood method on frequen¬ 
cies Il5l l30ll . It starts by taking a different view of the data set. 
Using the Python data set as a concrete example, we imagine 
the frequency vector F = (179624, 97186,..., 1,1) itself as 
being an unordered set of M data points from a parameter¬ 
ized distribution — given a new error message, how frequent 
is it? This distribution-on-frequencies is a transformed ver¬ 
sion of the inherent distribution F*, and also depends on the 
data set size. The goal, then, is to choose the parameters so 
as to maximize the likelihood of observing F. 

The analysis in |I5] [30l primarily achieves rigor for con¬ 
tinuous distributions. For discrete distributions, it turns out 
that the distribution-on-frequencies is actually given by an¬ 
other distribution which seems to have been first studied by 
Evert m. To describe it we recall the F function, which is 
the (shifted) analytic continuation of the factorial function, 
satisfying r(n) = (n — 1)! at positive integer values and 
r(n) = (n— l)r(n— 1) on its whole domain. The beta func¬ 
tion, another standard function, is a continuous analogue of 
the binomial coefficient, defined by 

B(a:, y) := r(a;)r(?/)/r(a: -f y). 

Then, finally, the Evert distribution is the frequency dis¬ 
tribution, parameterized by one parameter a, defined by 

frequency of / cx B{f + 1 — a, a). 

It is involved in our analysis for the following reason: 

Proposition 1. Suppose that we draw samples from a dis¬ 
crete Zipf-Mandelbrot distribution with parameters 7 and t. 
If the number of samples is large, then for all small f, the ex¬ 
pected number of f-legomena is proportional to B{ f -\- 1 — 
a, a) where a = 1 -\- I/7. 

Paraphrasing, this says that the distribution-on-frequencies 
for a discrete Zipf-Mandelbrot distribution is the Evert dis¬ 
tribution. This result was obtained by Evert fSI though he 
expressed it in terms of “type density functions.” We re¬ 
prove it in Appendix [A| 

We remark that the proof of Proposition [T] remains valid 
even if the discrete Zipf-Mandelbrot distribution is perturbed 
by altering some of the highest probabilities, which ensures 
that it is still valid even if outliers a la Section ITT] occur. 

4.1.1 Remarks 

In ll5][30l, the focus of the analysis is on continuous power 
law distributions, and for that, the analogue of Proposition [T] 
is to use a simple power law with exponent a instead of an 
Evert distribution. Though the Evert distribution is not men¬ 
tioned in II 30 II . it is remarked that the another distribution, the 
Yule distribution, is an “an alternative and often more con¬ 
venient form” of the discrete power law. These conveniences 
are mathematical in nature: the normalizing constant, expec¬ 
tation, variance, and moments of the Yule distribution have 


nicer closed forms than a pure power law. (And it is reason¬ 
able to use in power law analysis because up to a scaling fac¬ 
tor, the Yule distribution becomes a discrete power law in the 
limit.) These conveniences holds for the Evert distribution 
too, since Evert and Yule differ only by a shift. Eor instance, 
our fitting code utilizes the identity a, a) = 

B(2 - a, a - 1) - B(Fmax + 2 - a, a - 1). 

4.2 Maximum Likelihood 

The Evert distribution allows us to compute the most likely 
value of a for the collection of frequencies F. Writing 
for B(/ + 1 — a, a), and (7“ for the normalizing constant 
with C“ • X)/=r ’^he a that maximizes 

M 

fc=i 

This can be determined numerically using binary search, 
using logarithms since the numbers involved are very small. 
Then we determine the parameter 7 using 7 = l/(a — 1). 

The only remaining issue is how to determine the value 
of the shift parameter t that has maximum likelihood. Propo- 
sition[ 2 does not help since t plays no role in its conclusion. 
(The reason for this apparent paradox is that the approxima¬ 
tion guarantee of Proposition is only valid for small fre¬ 
quencies.) Nonetheless, we can determine a value for t using 
some ideas from the analysis of the continuous case ||5]l^ |^ 

Proposition 2. Let a = 1 -I- I/ 7 . Suppose we draw a 
sample from the bounded continuous power-law distribution 
with exponent —a and domain (1, Then the 

{M -f l)-quantiles of this random variable are proportional 
to {t -f 1)”"’', (t -f 2)“'*',..., (f -f M)“^. Furthermore, the 
choice of t that maximizes the likelihood of observing F is 
t = {M + 1)/{fU1 - 1). 

The first conclusion says that a “typical” draw of M 
items from this continuous distribution-on-“frequencies” is 
a model for the Zipf-Mandelbrot distribution. The second 
conclusion gives us the rule that we use to compute t in our 
statistical fitting. We prove the proposition in Appendix [B] 

5. Fitting the Data 

In Evert’s paper || 8 ], rather than using maximum-likelihood, 
he proposes estimating a using a Chi-squared test on the first 
few =f^F~^(2),... values. This approach is im¬ 

plemented by the R library zipfR of Evert and Baroni ||9l. 

We fit our data sets to the Zipf-Mandelbrot family of 
distributions, using both the Chi-squared approach, and the 
maximum-likelihood method of Propositions and (im¬ 
plemented in Maple). The results of the fitting are shown in 

^ The authors of (a Ho) note that the continuous model reasonably resem¬ 
bles the discrete model when thinking about larger frequencies; the smallest 
continuous variables are the ones that would have to be distorted the most 
in order to become quantized. Thus, the inaccuracies of Proposition]^ are 
complementary to those of PropositionFT] 



Max. likelihood 

min. 

Java 

a = 1.216, f = 33.1 

a = 1.225, f = 25.7 

Python 

a = 1.165, f = 44.7 

a = 1.143, f = 65.7 

Python'^^° 

a = 1.131, f = 99.8 

a = 1.133, f = 92.9 


Table 2. Results of fitting our data sets to Zipf-Mandelbrot 
distributions with both methods. Python'^/° indicates the 
Python data set with the 3 commonest messages removed. 

Shifted log-log plot of CS Circles error 



Figure 3. Plot of the Python data set on shifted log-log 
axes, with shift t from maximum likelihood estimation. 
Red; observed frequencies; blue: Chi-squared fitted Zipf- 
Mandelbrot disttibution; green; maximum likelihood fitted 
Zipf-Mandelbrot distribution. 
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Figure 4. Plot of the Java data set, analogous to Figure 

Table The fit for the Python data set improved greatly by 
treating the three most common errors as outliers (c.f. Fig¬ 
ure [^1. In Figures and 1^ we show shifted log-log plots of 
the observed fits (the Python plot omits the outliers). For the 
Python-without-outliers data set, both methods give a good 
fit. For the Java data set, the maximum-likelihood method 
gives a significantly better fit than the Chi-squared method. 


















6. Future Work 


2009 . 


A few very short questions for future work are; (1) can the 
good fit be replicated in other Java/Python systems? (2) if so, 
what properties of the user base or programming ecosystem 
affect the a and t parameters? (3) do error messages in other 
languages also follow a Zipf-Mandelbrot distibution? 

In the context of the hypothetical extreme languages of 
Section [Ml Python’s slightly smaller value of a suggests 
that it tends to give more distinctive error messages. Is it ac¬ 
tually giving more information in its errors? Could it alterna¬ 
tively be explained due to artefacts like the non-parallelism 
mentioned in Section lZTI ? 

It would be interesting to re-analyze the discrete data 
sets in OOl using the Evert maximum likelihood method. 
Specihcally, this could be done for the data sets for word 
frequency, web hits, telephone calls, and citations, which are 
discrete distributions coming from a population that is large 
enough to be effectively infinite. Additionally, it would be 
interesting to apply the Kolmogorov-Smimov test suggested 
in ll^ to the Evert maximum likelihood method, to be more 
rigorous in our approach. 

Erom a more practical perspective, it would be not hard, 
and of a great potential benefit, to release a systematic data 
set of good beginner-friendly explanations of the top errors 
in different programming languages. Eurther work could try 
to quantify if this improves the ability of beginner students 
to program independently. 
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A. Proof of Proposition [J 

Fix a constant / and consider N as growing to infinity. By 
linearity of expectation, the expected number 
of /-legomena is equal to the sum, over all £, of the prob¬ 
ability that word i occurs exactly / times in our sample. 
For large N, any word with frequency bigger than a con¬ 
stant has vanishingly small probability of occurring only / 
times. So for a word that may become an /-legomenon, its 


number of occurrences is well-approximated by a Poisson 
random variable, since it is a sum of many Bernoulli ran¬ 
dom variables, each with a small individual expectation. The 
expected number of occurrences of the £-th most common 
word is NC{i + so the number of times we observe it 
is a Poisson variable with expectation NC{£ + t)~^. 

This means that for any constant /, by the dehnition of a 
Poisson variable. 


Pr[word £ appears exactly / times] 


{NC{£ + t)-'^y 
/!exp(WC'(£ + f)-^)' 


Thus, the expected number of words appearing / times is 


E[#F-i(/)] 


{NC{£ + t)-^)f 

^/!exp(iVC(£ + f)-7)- 


We approximate this inhnite sum with the inhnite integral 


E[#F-i(/)] 


{Nc{x+t)-^y 
Ji flexp{NC{x + t)-'r) 


To evaluate it, we substitute y = NC{x + t) , i.e. x = 
—f and so dx = dy, giving 


E[#E-i(/)] 


(C'At)l/7 fNC H ^ yf-f-l 

P-1 Jo ^ 


Again assuming N large, the above integral is well-approximated 
by replacing the upper bound by -boo. Therefore, taking the 
terms that do not depend on / into the constant of propor¬ 
tionality, we hnd that 


E[#F-i(/)] cx 


-I /* + oo /— — —I 

1 / y-' 1 

pI 

r(/-i/7) 

r(/ + i) 


dy 


oc B(/- 1/7, l-bl/7) 


B(/ + 1 - a, a). 


B. Proof of Proposition 

Let U{a,b) denote a random variable from the uniform 
distribution on (a, 6). Our starting observation is that the 
continuous power-law distribution with exponent —a and 
unbounded domain (1, +c») is identical in distribution to 
C/(0,1)”'''. See, for instance, ||5] App. D]. 

Therefore, adding the bound to get the continuous power- 
law in the hypothesis of the theorem, said distribution is 
identical in distribution to , 1)“'’'. 

The (M + l)-quantiles of , 1)~'^ are ((f -b 

k)/{t + M for k = 1,..., M, so the hrst conclusion 
follows. 

Finally, the smaller the domain (1, the 

larger the probability density function at the observed F 
values, except that we need > Emax for Fmax 

to be observable at all. This proves the second conclusion. 













